Geospatial data can be expressed in maps, which can offer amazing levels of data density.
You’re probably familiar with JSON, which is frequently used to store and pass data between applications. GeoJSON applies JSON structure to geospatial data in a single JSON file. Let’s get data about child blood lead levels from the City of Philadelphia:
library(rgdal)
philly_lead_map <- readOGR('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=geojson&skipfields=cartodb_id')
## OGR data source with driver: GeoJSON
## Source: "https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+child_blood_lead_levels_by_ct&filename=child_blood_lead_levels_by_ct&format=geojson&skipfields=cartodb_id", layer: "OGRGeoJSON"
## with 380 features
## It has 5 fields
Let’s take a peek at the tabular data associated with this geospatial object (not the actual map)
head(philly_lead_map@data)
| census_tract | data_redacted | num_bll_5plus | num_screen | perc_5plus | |
|---|---|---|---|---|---|
| 0 | 42101000100 | 0 | 0 | 100 | 0 |
| 1 | 42101000200 | 1 | NA | 109 | NA |
| 2 | 42101000300 | 1 | NA | 110 | NA |
| 3 | 42101000401 | 1 | NA | 61 | NA |
| 4 | 42101000402 | 0 | 0 | 41 | 0 |
| 5 | 42101000500 | 1 | NA | 49 | NA |
summary(philly_lead_map@data)
| census_tract | data_redacted | num_bll_5plus | num_screen | perc_5plus | |
|---|---|---|---|---|---|
| 42101000100: 1 | Min. :0.0000 | Min. : 0.00 | Min. : 6.0 | Min. : 0.000 | |
| 42101000200: 1 | 1st Qu.:0.0000 | 1st Qu.: 7.00 | 1st Qu.:112.8 | 1st Qu.: 3.225 | |
| 42101000300: 1 | Median :0.0000 | Median :14.00 | Median :204.5 | Median : 5.700 | |
| 42101000401: 1 | Mean :0.3395 | Mean :17.63 | Mean :225.1 | Mean : 5.859 | |
| 42101000402: 1 | 3rd Qu.:1.0000 | 3rd Qu.:26.00 | 3rd Qu.:298.8 | 3rd Qu.: 8.475 | |
| 42101000500: 1 | Max. :1.0000 | Max. :81.00 | Max. :846.0 | Max. :17.600 | |
| (Other) :374 | NA | NA’s :129 | NA’s :8 | NA’s :126 |
And let’s take a quick peek at the map, to make sure there are no initial, obvious problems:
library(leaflet)
library(leaflet.extras)
leaflet(philly_lead_map) %>%
setView(lng = mean(philly_lead_map@bbox['x',], na.rm=TRUE),
lat = mean(philly_lead_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons()
Well, the defaults are ugly, but worse, we see some “holes” in our map. This problem was also evident in the data, which had too few rows.
We’re missing some tracts, because Philly has 384 tracts, and this data only has 380. Clearly some tracts were not surveyed. Let’s get all of Philly’s tracts, since we want to map the entire city, and provide imputation as needed. We can get complete GeoJSON file for all the Census tracts in Philly.
full_philadelphia_map <- readOGR('http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson')
## OGR data source with driver: GeoJSON
## Source: "http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson", layer: "OGRGeoJSON"
## with 384 features
## It has 14 fields
What kind of data is in this map?
head(full_philadelphia_map@data)
| OBJECTID | STATEFP10 | COUNTYFP10 | TRACTCE10 | GEOID10 | NAME10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | LOGRECNO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 42 | 101 | 009400 | 42101009400 | 94 | Census Tract 94 | G5020 | S | 366717 | 0 | +39.9632709 | -075.2322437 | 10429 |
| 1 | 2 | 42 | 101 | 009500 | 42101009500 | 95 | Census Tract 95 | G5020 | S | 319070 | 0 | +39.9658709 | -075.2379140 | 10430 |
| 2 | 3 | 42 | 101 | 009600 | 42101009600 | 96 | Census Tract 96 | G5020 | S | 405273 | 0 | +39.9655396 | -075.2435075 | 10431 |
| 3 | 4 | 42 | 101 | 013800 | 42101013800 | 138 | Census Tract 138 | G5020 | S | 341256 | 0 | +39.9764504 | -075.1771771 | 10468 |
| 4 | 5 | 42 | 101 | 013900 | 42101013900 | 139 | Census Tract 139 | G5020 | S | 562934 | 0 | +39.9750563 | -075.1711846 | 10469 |
| 5 | 6 | 42 | 101 | 014000 | 42101014000 | 140 | Census Tract 140 | G5020 | S | 439802 | 0 | +39.9735358 | -075.1630966 | 10470 |
Oh, okay, it has Census Tract data. Is it complete? Let’s check… does our data frame now have 384 rows?
nrow(full_philadelphia_map@data)
## [1] 384
Let’s do a quick mapping sanity check. We’ll change the default color and line width settings this time around:
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = "white",
fillOpacity = 1,
label = full_philadelphia_map@data$NAMELSAD10
) %>%
suspendScroll() # Handy to stop those accidental "scrolling by" zooms
This map looks great, and the tabular data might prove useful, too, for things like a “prettier” display of the Census Tract name, instead of just a long number. Let’s combine our lead data into this fuller map.
WARNING
merge will reorder your rows, and that breaks the relationship between the rows of tabular data and the corresponding polygons. Not what you want! So we’re going to keep track of the order thanks to the helpful “OBJECTID” variable, which is super helpful. If you’re working with data that doesn’t have a column like this, just add it, so you can keep track of what the original order was.
Merging in this case is pretty simple – we just have to bring in the lead data and make sure our “hinge” (overlapping field) is set up properly:
merged <- merge(x = philly_lead_map@data,
y = full_philadelphia_map@data,
by.x = "census_tract",
by.y = "GEOID10",
all = TRUE)
full_philadelphia_map@data <- merged[order(merged$OBJECTID),]
Let’s see what our lead levels look like without any imputation of missing values:
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#aaaaaa")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1,
label = full_philadelphia_map@data$NAMELSAD10
) %>%
suspendScroll()
Let’s do some imputation of missing values. We’ll use knn imputation in part. To do that, we first identify the neighbors for each tract:
library(spdep)
coords <- coordinates(full_philadelphia_map)
tracts <- row.names(as(full_philadelphia_map, "data.frame"))
knn10 <- knn2nb(knearneigh(coords, k = 10), row.names = tracts)
knn10 <- include.self(knn10)
Impute data for missing rows.
We have two cases here:
Let’s first think about intelligent imputation on the census tracts that were in the lead data dataset, but with redacted numbers. That’s almost all our missing data.
We know that the actual number of children testing with high blood lead levels in the case of imputation is either 1, 2, 3, 4, or 5 children, so we know the actual possible percentages. So we can do knn imputation, but if the knn imputation comes in lower than the lowest possible percent, we’ll just toss in the lowest possible, and do a similar check with the highest possible percent.
Note that knn averaging will often give us an NA, because there are large areas of Philadelphia with no nearby tracts that were tested. In that case, we can simply impute whatever the percentage is for 3 children testing positive (the median value of all possibilities).
In the case of completely missing data, let’s do a second round of imputation. We’ll let the first round go and fill in the gaps of redacted data. Then, we’ll have good values for neighbors and can do knn imputation.
So, first we’ll handle redacted data:
for (row in 1:nrow(full_philadelphia_map@data)) {
# is this row intentionally redacted?
if (!is.na(full_philadelphia_map@data$data_redacted[row]) &
full_philadelphia_map@data$data_redacted[row] == 1) {
# lowest possible pct
low_pct <- (1/full_philadelphia_map@data$num_screen[row])*100
# highest possible pct
high_pct <- (5/full_philadelphia_map@data$num_screen[row])*100
# the "expected value" or median (and mean) number of kiddos with high lead
median_pct <- (3/full_philadelphia_map@data$num_screen[row])*100
# take the mean of our neighbors
knn_pct <- mean(full_philadelphia_map@data$perc_5plus[unlist(knn10[row])], na.rm=TRUE)
if (is.na(knn_pct)) {
# no way to determine a good guess by neighbors? Choose median.
full_philadelphia_map@data$perc_5plus[row] <- median_pct
}
else if (!is.na(low_pct) & !is.na(knn_pct) & low_pct > knn_pct) {
# knn lower than lowest? Use lowest.
full_philadelphia_map@data$perc_5plus[row] <- low_pct
}
else if (!is.na(high_pct) & !is.na(knn_pct) & high_pct < knn_pct) {
# knn higher than highest? Use highest
full_philadelphia_map@data$perc_5plus[row] <- high_pct
}
else {
# Otherwise, use knn.
full_philadelphia_map@data$perc_5plus[row] <- knn_pct
}
}
}
Now let’s handle the completely missing data:
for (row in 1:nrow(full_philadelphia_map@data)) {
if (is.na(full_philadelphia_map@data$perc_5plus[row])) {
full_philadelphia_map@data$perc_5plus[row] <- mean(full_philadelphia_map@data$perc_5plus[unlist(knn10[row])], na.rm=TRUE)
}
}
jawn
/jôn/
noun DIALECT US
(chiefly in eastern Pennsylvania) used to refer to a thing, place, person, or event that one need not or cannot give a specific name to.
"these jawns are very inexpensive"
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1
) %>%
suspendScroll()
And here we want to pull in our local file, which is a simplified version of data from the American Community Survey conducted by the Census Bureau. Let’s see what is contains.
economic_data <- read.csv("../Data/philly_census.csv")
head(economic_data)
| census_tract | pct_unemployed | pct_commute_public_transit | median_household_income | mean_household_income | pct_child_uninsured | pct_families_below_poverty_line |
|---|---|---|---|---|---|---|
| 4.2101e+10 | 2.6 | 33.4 | 103772 | 124525 | 33.9 | 1.7 |
| 4.2101e+10 | 7.6 | 22.0 | 50455 | 99337 | 6.5 | 13.7 |
| 4.2101e+10 | 4.6 | 11.5 | 93036 | 121867 | 3.3 | 3.4 |
| 4.2101e+10 | 5.9 | 21.8 | 57604 | 83354 | 0.0 | 23.2 |
| 4.2101e+10 | 2.0 | 14.6 | 70038 | 92625 | 0.0 | 0.0 |
| 4.2101e+10 | 12.3 | 20.6 | 40568 | 60813 | 9.1 | 0.0 |
This is selected economic characteristics of various census tracts. Let’s combine the data here with our map, and use labels to allow people to understand the data better:
merged_again <- merge(x=full_philadelphia_map@data,
y=economic_data,
by = "census_tract",
all = TRUE)
full_philadelphia_map@data <- merged_again[order(merged_again$OBJECTID),]
str(full_philadelphia_map@data)
## 'data.frame': 384 obs. of 24 variables:
## $ census_tract : Factor w/ 384 levels "42101000100",..: 95 96 97 134 135 136 137 138 139 140 ...
## $ data_redacted : int 0 0 0 0 0 0 1 0 1 0 ...
## $ num_bll_5plus : int 14 11 14 17 14 7 NA 6 NA 13 ...
## $ num_screen : int 306 253 314 157 264 140 125 159 73 204 ...
## $ perc_5plus : num 4.6 4.3 4.5 10.8 5.3 5 4 3.8 5 6.4 ...
## $ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ STATEFP10 : Factor w/ 1 level "42": 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTYFP10 : Factor w/ 1 level "101": 1 1 1 1 1 1 1 1 1 1 ...
## $ TRACTCE10 : Factor w/ 384 levels "000100","000200",..: 95 96 97 134 135 136 137 138 139 140 ...
## $ NAME10 : Factor w/ 384 levels "1","10.01","10.02",..: 369 370 371 43 44 46 47 48 49 50 ...
## $ NAMELSAD10 : Factor w/ 384 levels "Census Tract 1",..: 369 370 371 43 44 46 47 48 49 50 ...
## $ MTFCC10 : Factor w/ 1 level "G5020": 1 1 1 1 1 1 1 1 1 1 ...
## $ FUNCSTAT10 : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
## $ ALAND10 : int 366717 319070 405273 341256 562934 439802 562132 789935 570015 609439 ...
## $ AWATER10 : int 0 0 0 0 0 0 0 277434 282808 0 ...
## $ INTPTLAT10 : Factor w/ 384 levels "+39.8798897",..: 106 114 113 141 137 132 126 110 120 131 ...
## $ INTPTLON10 : Factor w/ 384 levels "-074.9667387",..: 350 362 370 257 234 208 168 129 113 135 ...
## $ LOGRECNO : Factor w/ 384 levels "10335","10336",..: 95 96 97 134 135 136 137 138 139 140 ...
## $ pct_unemployed : num 19.2 12.3 15.2 15.8 13.4 11.7 17.9 1.5 6.1 10.7 ...
## $ pct_commute_public_transit : num 49.1 53.2 39.4 19.6 37.8 49 28.9 21.2 21.8 24.9 ...
## $ median_household_income : int 18408 27708 24402 28534 14314 24474 19167 85765 64167 54393 ...
## $ mean_household_income : int 31068 33932 33784 41352 30200 36123 43450 115164 95890 62740 ...
## $ pct_child_uninsured : num 4.6 20 1.2 4.8 1.4 3.7 0 4.4 3.9 9.4 ...
## $ pct_families_below_poverty_line: num 41.5 32.9 39.1 24.2 47.7 33.4 20.5 0 3.7 28.2 ...
Let’s create some useful labels:
labels <- sprintf(
"<strong>%s</strong><br/>
Families Below Poverty Line (%%): %g <br/>
Children With High Blood Lead Levels (%%): %g",
full_philadelphia_map@data$NAMELSAD10,
full_philadelphia_map@data$pct_families_below_poverty_line,
full_philadelphia_map@data$perc_5plus
) %>% lapply(htmltools::HTML)
lead_palette <- colorBin("Blues", domain = full_philadelphia_map@data$perc_5plus, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 1,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")
) %>%
suspendScroll()
But this doesn’t really show us the impact of poverty overall. Ideally we’d love to be able to have a dual-purpose choropleth. Let’s do that now, using the keyword “group” and some layer controls.
poverty_palette <- colorBin("Reds", domain = full_philadelphia_map@data$pct_families_below_poverty_line, bins = 10, na.color = "#cccccc")
leaflet(full_philadelphia_map) %>%
setView(lng = mean(full_philadelphia_map@bbox['x',], na.rm=TRUE),
lat = mean(full_philadelphia_map@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~lead_palette(full_philadelphia_map@data$perc_5plus),
fillOpacity = 0.5,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"),
group = "Lead Level"
) %>%
addPolygons(
weight = 1, # border thickness
opacity = 0.5, # border opacity
color = "grey", # border color
fillColor = ~poverty_palette(full_philadelphia_map@data$pct_families_below_poverty_line),
fillOpacity = 1,
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"),
group = "Poverty Level"
) %>%
addLayersControl(
baseGroups = c("Lead Level", "Poverty Level"),
options = layersControlOptions(collapsed = FALSE)
) %>%
suspendScroll()